Mining Statistically Significant Patterns using the Chi-Square Statistic
نویسنده
چکیده
Statistical significance is used to ascertain whether the outcome of a given experiment can be ascribed to some extraneous factors or is solely due to chance. An observed pattern of events is deemed to be statistically significant if it is unlikely to have occurred due to randomness or chance alone. In the thesis, we study the problem of identifying the statistically relevant patterns in strings and graphs. We use the chi-square statistic as a quantitative measure of statistical significance. The problem of identification of statistically significant patterns in a sequence of data has been applied to intrusion detection systems, financial models, web-click records, automated monitoring system, computational biology, cryptology, text analysis, etc. Given a string of characters from a memoryless Bernoulli model, the problem is to identify the substrings that differ from the expected or normal behavior. The most significant substring (MSS) of a string is the one having the highest chi-square statistic. Till date, to the best of our knowledge, no algorithm to find the most significant substring or top-t significant substrings exists which does better than O(n) time, where n denotes the length of the string. In this paper, we propose an algorithm to find the MSS whose expected running time is O(n). For any constant t, we extend this algorithm to find the top-t significant substrings, which again takes expected O(n) time. We derive some bounds on the chi-square statistic value of MSS which provides the basis of the verification of null hypothesis. We also propose a novel algorithm for the verification of null hypothesis. Further, we study the problem of finding statistically significant subgraphs in an undirected, unweighted graph. The nodes of the graph are labeled from an alphabet of size k where each symbol is drawn from a fixed probability distribution P . The problem has impor-
منابع مشابه
Network Anomalies Detection Using Statistical Technique : A Chi- Square approach
Intrusion Detection System is used to detect suspicious activities is one form of defense. However, the sheer size of the network logs makes human log analysis intractable. Furthermore, traditional intrusion detection methods based on pattern matching techniques cannot cope with the need for faster speed to manually update those patterns. Anomaly detection is used as a part of the intrusion det...
متن کاملMining Statistically Significant Substrings using the Chi-Square Statistic
The problem of identification of statistically significant patterns in a sequence of data has been applied to many domains such as intrusion detection systems, financial models, web-click records, automated monitoring systems, computational biology, cryptology, and text analysis. An observed pattern of events is deemed to be statistically significant if it is unlikely to have occurred due to ra...
متن کاملChi-squared computation for association rules: preliminary results
Chi squared analysis is useful in determining the statistical significance level of association rules. We show that the chi squared statistic of a rule may be computed directly from the values of confidence, support, and lift (interest) of the rule in question. Our results facilitate pruning of rule sets obtained using standard association rule mining techniques, allow identification of statist...
متن کاملMaximally selected chi-square statistics and umbrella orderings
Binary outcomes that depend on an ordinal predictor in a nonmonotonic way are common in medical data analysis. Such patterns can be addressed in terms of cutpoints: for example, one looks for two cutpoints that define an interval in the range of the ordinal predictor for which the probability of a positive outcome is particularly high (or low). A chi-square test may then be performed to compare...
متن کاملMaximally selected chi-square statistics and non-monotonic associations: an exact approach based on two cutpoints
Binary outcomes that depend on an ordinal predictor in a non-monotonic way are common in medical data analysis. Such patterns can be addressed in terms of cutpoints: for example, one looks for two cutpoints that define an interval in the range of the ordinal predictor for which the probability of a positive outcome is particularly high (or low). A chi-square test may then be performed to compar...
متن کامل